NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Do Multi-Document Summarization Models Synthesize ?

https://doi.org/10.1162/tacl_a_00687

DeYoung, Jay; Martinez, Stephanie C; Marshall, Iain J; Wallace, Byron C (September 2024, Transactions of the Association for Computational Linguistics)
Louis, Annie (Ed.)
Abstract Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.
more » « less
Full Text Available
Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Wadwha, Somin; DeYoung, Jay; Nye, Benjamin; Amir, Silvio; Wallace, Byron C (August 2024, Machine Learning for Healthcare (MLHC))

Full Text Available
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations

https://doi.org/10.18653/v1/2023.acl-long.549

Wang, Lucy Lu; Otmakhova, Yulia; DeYoung, Jay; Truong, Thinh Hung; Kuehl, Bailey; Bransom, Erin; Wallace, Byron (January 2023, Proceedings of the Association of Computational Linguistics (ACL))

Full Text Available
Overview of MSLR2022: A Shared Task on Multi-document Summarization for Literature Reviews

Wang, Lucy Lu; DeYoung, Jay; Wallace, Byron (January 2022, Proceedings of the Third Workshop on Scholarly Document Processing)

We provide an overview of the MSLR2022 shared task on multi-document summarization for literature reviews. The shared task was hosted at the Third Scholarly Document Processing (SDP) Workshop at COLING 2022. For this task, we provided data consisting of gold summaries extracted from review papers along with the groups of input abstracts that were synthesized into these summaries, split into two subtasks. In total, six teams participated, making 10 public submissions, 6 to the Cochrane subtask and 4 to the MSˆ2 subtask. The top scoring systems reported over 2 points ROUGE-L improvement on the Cochrane subtask, though performance improvements are not consistently reported across all automated evaluation metrics; qualitative examination of the results also suggests the inadequacy of current evaluation metrics for capturing factuality and consistency on this task. Significant work is needed to improve system performance, and more importantly, to develop better methods for automatically evaluating performance on this task.
more » « less
Full Text Available
Overview of MSLR2022: A Shared Task on Multi-document Summarization for Literature Reviews

Lu Wang, Lucy; DeYoung, Jay; Wallace, Byron (January 2022, Proceedings of the Third Workshop on Scholarly Document Processing at COLING)

We provide an overview of the MSLR2022 shared task on multi-document summarization for literature reviews. The shared task was hosted at the Third Scholarly Document Processing (SDP) Workshop at COLING 2022. For this task, we provided data consisting of gold summaries extracted from review papers along with the groups of input abstracts that were synthesized into these summaries, split into two subtasks. In total, six teams participated, making 10 public submissions, 6 to the Cochrane subtask and 4 to the MSˆ2 subtask. The top scoring systems reported over 2 points ROUGE-L improvement on the Cochrane subtask, though performance improvements are not consistently reported across all automated evaluation metrics; qualitative examination of the results also suggests the inadequacy of current evaluation metrics for capturing factuality and consistency on this task. Significant work is needed to improve system performance, and more importantly, to develop better methods for automatically evaluating performance on this task.
more » « less
Full Text Available
Evidence Inference 2.0: More Data, Better Models

DeYoung, Jay; Lehman, Eric; Nye, Ben; Marshall, Iain J.; Wallace, Byron C. (July 2020, BioNLP: Workshop on Biomedical Natural Language Processing)

Full Text Available
ERASER: A Benchmark to Evaluate Rationalized NLP Models

DeYoung, Jay; Jain, Sarthak; Rajani, Nazneen Fatema; Lehman, Eric; Xiong, Caiming; Socher, Richard; Wallace, Byron C. (January 2020, Transactions of the Association for Computational Linguistics)

Full Text Available
Inferring Which Medical TreatmentsWork from Reports of Clinical Trials

Lehman, Eric; DeYoung, Jay; Barzilay, Regina; Wallace, Byron C. (January 2019, Annual Conference of the North American Chapter of the Association for Computational Linguistics)

How do we know if a particular medical treatment actually works? Ideally one would consult all available evidence from relevant clinical trials. Unfortunately, such results are primarily disseminated in natural language scientific articles, imposing substantial burden on those trying to make sense of them. In this paper, we present a new task and corpus for making this unstructured evidence actionable. The task entails inferring reported findings from a full-text article describing a randomized controlled trial (RCT) with respect to a given intervention, comparator, and outcome of interest, e.g., inferring if an article provides evidence supporting the use of aspirin to reduce risk of stroke, as compared to placebo. We present a new corpus for this task comprising 10,000+ prompts coupled with fulltext articles describing RCTs. Results using a suite of models — ranging from heuristic (rule-based) approaches to attentive neural architectures — demonstrate the difficulty of the task, which we believe largely owes to the lengthy, technical input texts. To facilitate further work on this important, challenging problem we make the corpus, documentation, a website and leaderboard, and code for baselines and evaluation available at http: //evidence-inference.ebm-nlp.com/.
more » « less
Full Text Available

Search for: All records